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Abstract 

We tackle the change-point problem with data belonging to a general set. We pro- 
pose a penalty for choosing the number of change-points in the kernel algorithm of 
Harchaoui and Cappe (2007). This penalty generalizes the one proposed for one dimen- 
sional signals by Lebarbier (2005). We prove it satisfies a non-asymptotic oracle inequality 
by showing new concentration results in Hilbert spaces. Experiments on synthetic and real 
data illustrate the accuracy of our method, showing it can detect changes in the whole 
distribution, even when the mean and variance are constant. Our algorithm can also deal 
with data of complex nature, such as the GIST descriptors which are commonly used for 
video temporal segmentation. 

Keywords: model selection, kernel methods, change-point problem, concentration in- 
equality 

1. Introduction 

A central topic in machine learning is finding the boundary between samples drawn from 
different probability distributions. This goal is at the intersection of supervised learn- 
ing (such as binary classification, see Vapnik, 1998; Steinwart and Christmann, 2008) and 
unsupervised learning (such as clustering, see von Luxburg, 2009). In the latter case, a 
major theoretical issue arises when considering real-world problems, namely the model se- 
lection issue which corresponds to selecting the number of clusters (Ben-David et al., 2006; 
von Luxburg, 2009). This issue is still an open problem. 

In this paper, we consider a related topic, the change-point problem (Carlstein et al., 
1994). Let Xi, . . . ,Xn be a sequence of independent random variables, whose distribution 
abruptly changes at given unknown instants (change-points). The change-point problem 
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consists in (i) estimating the change-point locations given their number, (ii) determining 
the number of change-points. 

Given a positive semi-definite kernel k and its associated feature map <I>, our approach is 
to solve the change-point regression problem via model selection with ^{Xi), . . . , <I>(X„) G T-L 
some Hilbert space, by extending the work of Lebarbier (2005) to the Hilbertian setting. 

Unlike usual model selection approaches in the one dimensional setting focusing on 
changes in the mean or the variance (Laviclle, 2005; Lebarbier, 2005), our approach can 
capture changes in higher-order moments of probability distributions, using the machinery 
of reproducing kernel Hilbert spaces. Another strength of the kernelized least-squares algo- 
rithm we propose is it can process time series with observations of any nature, as long as 
some positive-definite kernel can be defined on their support, including data belonging to 
some structured spaces such as the d-dimensional simplex. This is particularly appropriate 
for temporal segmentation of video streams (see Section 6) for automatic summarization 
of video archives. For multivariate signals in W^, other approaches were recently proposed, 
mainly dedicated to biological applications. Picard et al. (2011) focus on changes in the 
mean and make a Gaussian assumption on the signal. Bleakley and Vert (2011) propose a 
fused lasso based algorithm to perform segmentation of the mean as well. Our approach 
is more general since it is not limited to changes in the mean and does not rely on any 
distributional assumption on the intra-segment distributions. 

Without assuming the number of change-points is known, our algorithm makes use of 
the efficient algorithm of Harchaoui and Cappe (2007). This is a significant improvement 
for practical application. Furthermore, we prove theoretical guarantees for our data-driven 
choice of the number of change-points, with a non-asymptotic oracle inequality (Theorem 1). 

The main contributions of the paper are the following: (i) proposing a penalty ex- 
tending the one of Lebarbier (2005) to the kernel change-point problem, which allows a 
data-driven choice of the number of change-points, (ii) proving it satisfies a non-asymptotic 
oracle inequality (Theorem 1), by developping new concentration results in Hilbert spaces, 
(iii) showing with experiments (Section 6) the resulting algorithm is promising in terms of 
applications, both for detecting changes in distribution that are not changes in the mean 
or the variance, and for analyzing data of complex nature such as video streams. 

2. Model selection for the change-point problem: one-dimensional data 

Let us start by summarizing how the change-point problem has been cast as a model 
selection problem in the case of one-dimensional data (Lavielle, 2005; Lebarbier, 2005). Let 
< ti < • • • < i„ < 1 be deterministic instants of observation, ^* some measurable function 
[0, 1] ^ ^ = M and 

Vi G {l,...,?i} , Yi = ^* + ei, where ji* = lJ*{ti) 

and £!,...,£„ are independent and identically distributed random variables with E [ej] = 
andE[e2] =^2^0. The mean fi*{ti) of the observations Yi is assumed piecewise constant 
and the goal is to find change-points, that is the location of jumps in the mean. A classical 
approach is to solve a least-squares regression problem by estimating /i* with a piecewise 
constant function, with the number of change-points selected through a model selection 
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procedure (see Yao, 1988; Yao and Au, 1989; Lavielle and Moulines, 2000; Boysen et al., 
2009, and Section 4.3). 

Since /i* is only evaluated at ti , . . . , t.„ , it is considered as an element of with its 
Euclidean structure given by ||/ — = X]"=i(/(ii) — diU))"^ for every f,g£ Ti^. We also 
use the notation Y = (Yi,...,y„)' G H". For every function / : [0,1] — > 7i, we define 
respectively its quadratic and empirical risk 

^(/):=-||/-m1|' and ^„ (/) := i ||/ - yf . (1) 
n n 

Let Mn be the set of segmentations of {1, . . . , n}, that is, the set of partitions of the form 
{{1, . . . , ki}, {ki + 1, . . . , k2}, . . . , {ko-i — 1, • • • , n}} with D > 1 and 1 < A;i < • • • < /cd-i < 
n. For every m £ J^n, let Dm = Card(m) and Sm be the set of functions {ti, . . . ,tn} -^H 
that are constant over {ti)i^\ for every segment X £ m. Then, the associated empirical risk 
minimizer, called regressogram, is defined by 

Jim G argminjg5^ | ( / ) | , so that VA G m, Vi G A , fimiti) = ^ ^ ^ Yj . 

The goal is to build a data-driven choice m G M.n such that the quadratic risk TZ {'p.fh) 
is minimal. Following Birge and Massart (2001) and Lebarbier (2005), this model selection 
problem can be solved in a non-asymptotic manner by penalization: 

m G argmin„g_yvi„ I ^ - + Pen(m) I , (2) 

where pen(m) = penBM("T') := — ^ci log ^ ^ -|- C2 ^ with ci,C2>0 . (3) 

If the noise variables £i are Gaussian, Lebarbier (2005) proved that Eq. (2) leads to an 
oracle inequality, that is, constants ci , C2 , i^i , > exist such that 



E 
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< Ki inf I - Wfim. - ^*||^ + penBM("T') 1 + • (4) 



The log(n) term in the penalty is the unavoidable price for ignoring change-point locations 
(Birge and Massart, 2007). Furthermore, extensive simulation experiments of Lebarbier 
(2005) suggested the values ci = 2, C2 = 5 and an efficient data-driven way of estimating 
a"^ , called the slope heuristics. 

3. Kernel change-point problem 

Let us now describe how we generalize the approach of Section 2 to detecting changes in the 
probability distribution of the signals that belong to any set (not necessarily vector spaces). 

3.1 Problem 

Let X be some set and assume we observe independent random variables Xi, . . . ,Xn G X 
at time ti^...,tn with a piecewise-constant probability distribution. The goal is to find 
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abrupt changes in the distribution of the time series Xi, . . . whereas classical change- 
point estimation seeks for changes in the first moments of the distribution such as the 
mean or the variance (Korostelev and Korosteleva, 2011). Let k : X x X ^ W he some 
positive definite kernel, H = Tik the associated reproducing kernel Hilbert space, and 
^ : X ^ Ti the canonical feature map defined by ^(x) = k{x, •) (see Scholkopf and Smola, 
2001; Cucker and Zhou, 2007; Steinwart and Christmann, 2008, for a detailed presentation 
of reproducing kernel Hilbert spaces). Then, for every iG{l,...,n}we define 

Yi = ^Xi) G n 

and & H the mean element of the distribution of Xi, that is. 

Following Sriperumbudur et al. (2008, 2010), we can exploit the strong connection between 
the mean element ^* and the distribution of Xi. For instance with translation- invariant 
kernels satisfying a condition on their Fourier transform, equality of mean elements implies 
equality of probability distributions (Sriperumbudur et al., 2008). So, we can focus on 
detecting changes in the mean elements, assuming 

* * it * * * 

A*l — ■ ■ ■ — Mfc* ; — • • • — fJ-k* , • • • /^fcjj^^^+i — ■ ■ ■ — 

for some 1 < k* < ■ ■ ■ < k^i,_^ < n (the true change-point indices). Moreover if we 
define Si := Yi — fi* (for which we assume its "variance" Vi = E,[\\ei\\y] is finite for every 
i), the approach of Section 2 formally extends from "^^ = M to any Hilbert space T-L. The 
quadratic and empirical risks of / G are then defined by Eq. (1) again with ||/ — gW"^ = 
Etl\\fi-9^\\niov every f,gen^. 

The rest of the paper provides theoretical grounds for such an extension, showing in 
particular a penalty of the form (3) can still be used in the kernel setting with o"^ replaced 
by an upper bound on maxjUj. Since we aim at analyzing high-dimensional time series, 
we will provide an analysis from the non-asymptotic point of view, by proving an oracle 
inequality similar to Eq. (4). Note that such an extension is formally straightforward, but it 
still requires to solve some theoretical issues, since some key elements for proving Eq. (4) are 
no longer valid in the Hilbertian setting. These issues are detailed in the next subsection. 

3.2 Related v^ork and theoretical challenges 

A kernelized version of the approach of Section 2 was proposed by Harchaoui and Cappe 
(2007), but assuming the number of change-points is known. Our algorithm is the same for 
every fixed number of change-points, but goes one step further, since we do not assume the 
number of changes is known a priori. 

The penalty (3) and the proofs of Birge and Massart (2001) and Lebarbier (2005) cannot 
be extended directly in our case because (i) Yi = $(Xj) are not real but Hilbert space 
valued random variables (with a possibly infinite-dimensional Hilbert space), (ii) Birge 
and Massart 's approach heavily relies on the assumption that the noise Si is Gaussian 
with a constant variance which is questionable in our Hilbertian setting. Indeed, if data 
were Gaussian in the feature space, then any linear projection would follow a Gaussian 
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distribution, and kernel principal component analysis with usual kernels indicate this does 
not hold for most real data sets. 

The key step in Birge and Massart's approach is to design a penalty pen(-) such that 

1 2 1 - 2 

VmG7W„, pen(m) > penjd(m) := - - ||/2m-y|| (5) 

n n 

with high probability, without taking pen(7?T,) larger than necessary. The quantity penj^(m) 
is called "ideal penalty" since using it in Eq. (2) would lead to minimizing the quadratic 
risk. For proving Eq. (5), Birge and Massart (2001) use the concentration properties of 
functions of Gaussian variables. 

In our non-Gaussian Hilbertian setting, two concentration inequalities could be used in- 
stead: (i) Pinelis-Sakhanenko's inequality (Pinelis and Sakhanenko, 1986), {ii) Talagrand's 
inequality (see Bousquet, 2002). The first one cannot be used as such since it is not a con- 
centration but a deviation inequality, hence too loose for our purpose. The second one is 
not accurate enough in our setting because it yields too large deviation terms, see Remark 7 
in the appendix. 

4. Oracle inequality for the kernel change-point problem 

This section shows how the penalty (3) can be extended to the Hilbertian setting of Sec- 
tion 3, by proving an oracle inequality (Theorem 1). 

4.1 Assumptions 

Without a Gaussian homoscedastic assumption, we need to assume the following. Let us 
recall Vi := E[\\Yi - = ]E[||ei|||^], for every i. 

Bounded data/kernel : 3M > , sup = k{Xi, Xi) < a.s. (Db) 

l<i<n 

Bounded variance : 3umax < +oo , max Vi < Vmax (Vmax) 

l<i<n 

M2 . 

Minimal variance : 30 < Cmin < +oo , min v-i > =: Vmin > . (Vmin) 

l<i<n Cmin 

Let us make a few remarks: 

• (Db) implies (Vmax) with Umax = since Vi = E[k{Xi, Xi)] - < M^. 

• if A: is translation invariant, that is, k{x, x') = k{x — x') (e.g., the Gaussian and Laplace 

kernels), then vi = k(0) — so that (Vmax) and (Vmin) are assumptions on 

II ,.*ii 

mi \\h- 

• if (Db) holds true, Vi = tr(Sj) where Sj is the covariance operator of ^{Xi). 

• if *Y = M'^ and k{x,y) = {x, y), Vi = tr(Si) where is the covariance matrix of Sj. 



5 



Arlot, Celisse and Harchaoui 



4.2 Oracle inequality for change-point estimation 

The following theorem shows an oracle inequality still holds for the kernel change-point 
problem with a penalty of the form (3) where is replaced by Umaxi up to numerical 
constants. 

Theorem 1 Let us consider the kernel change-point problem described in Section 3. As- 
sume (Db), (Vmin) and (Vmax) hold true. Then, some numerical constant Li > exists 
such that for every x > 0, an event of probability at least 1 — exists on which, for any 
C > c^i^Li and any 



m G argmin„g_^^ j'^n (/^m) + pen(m) | with pen(m) 



n 



, (6) 



7^(/ia)<2 inf {7e(/i,„) + 2pen(m)} + ^^^^^ii±^l^ . (7) 

m£M n 

A sketch of proof of Theorem 1 is given in Section 4.4 and a complete proof can be found 
in Appendix B.5. If = M, k[x, y) = xy, and Vi, Vi = fmax > 0, we recover (Theorem 1) 
an oracle inequality similar to the one of Lebarbier (2005). 

Note that (Db) is a classical assumption in the machine learning literature on ker- 
nels. It holds true for instance with bounded kernels such as the Gaussian kernel (see 
Section 5.2). In particular, it avoids assuming data are Gaussian as in Birge and Massart 
(2001). Assumptions (Vmin)-(Vmax) are a natural extension of homoscedastic setting 
of Birge and Massart (2001), which would not be realistic in our Hilbertian setting. Note 
that a fully heteroscedastic setting might be considered, for instance following the ideas of 
Arlot and Celisse (2011), but with a more complex algorithm and no theoretical guarantees. 
We choose (Vmin)-(Vmax) as a compromise between these two extremes. 

The constant 2 in front of the oracle inequality (7) can be chosen arbitrary close to 
1, at the price of an increase of the numerical constant Li (which appears in the penalty 
and in the remainder term through C). Besides, the constant C suggested by the proof of 
Theorem 1 certainly is not tight, as in all similar non-asymptotic oracle inequalities. 

Finally, let us mention a byproduct of the proof of Theorem 1 which is detailed in 
Appendix A: If some prior knowledge restricts the possible positions of change-points to a 
subset of {tl, . . . ,tn} with O(logn) elements, then a smaller penalty can be used instead 
of (6), leading to an oracle inequality that is optimal in the homoscedastic case. 



4.3 Discussion: change-point problem and oracle inequalities 

Let us discuss the relationship between minimizing the risk (proving an oracle inequality 
like Eq. (4)) and the original change-point problem. 

In the one-dimensional setting {X = H = R), an oracle inequality shows that 'fifn is 
close to the best piecewise-constant estimator of fj* in terms of quadratic risk. So, we 
can roughly expect that in detects all jumps of size (//*(fj+i) — fi*{ti))'^ significantly larger 
than the noise-level cr'^/N, where N is the number of observations available around the 
jump. In the non-asymptotic point of view, it seems reasonable (and desirable) to aim 
only at detecting jumps for which enough observations are available, which explains why 
the procedure proposed by Lebarbier (2005) yields good results in terms of change-point 
estimation. 
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In the kernehzed version of this approach, a similar heuristics holds (as confirmed by our 
simulation experiments, see Section 6). However, both the size of a jump, now measured by 
WfJ-i+i ~ f^iWn^ the noise-level depend on the kernel k. So, k should be chosen in order 
to maximize the signal-to- noise ratio at every true change-point. 

For instance, even when X = M, choosing an appropriate kernel k can lead to detect 
changes in the mean (with k{x,y) = xy), but also in other features of the distribution (for 
instance with the Gaussian kernel, see the experiments of Section 6). Therefore, kernelizing 
Birge and Massart's approach can also be useful in the one-dimensional case when we do 
not look for changes in the mean. 



4.4 Sketch of the proof of Theorem 1 

The proof mostly follows the general approach of Birge and Massart for proving an oracle 
inequality, that is, we prove new concentration inequalities (Propositions 2 and 3) that are 
needed to show the penalty defined by Eq. (6) satisfies Eq. (5) with a large probability. 
Note that our proof actually leads to a more general model selection result (Theorem 8 in 
the appendix) which admits corollaries of independent interest (see Appendix A). 

4.4.1 Elementary computations 

The proof starts by splitting the ideal penalty defined by Eq. (6) into two terms that 
will be concentrated separately. All statements that are not proved here are detailed in 
Appendix B.l. 

Recall that for every m G A^n, Sm is the vector space of functions {ti, . . . — )• Ti 
that are constant over each X £ m, and all functions / : {ii, . . . — )• Ti are written as 
elements of "H" by denoting fi = f(ti). In particular, Sm is considered as a linear subspace 
of "H". For f,g & "H", let (/, g) := {fi, gi)-^ denote the canonical scalar product in 

"H". The associated regressogram estimator is uniquely defined by 

/im = ^mX where ^g eTi"' , Um.g := argmin^g^^ { ^ "-^ ~ ^"^ } 

is the orthogonal projection of g onto 5m. We define also /zj^ := IlmfJ-*, and remark that 

V<7G1^", VAGm, ViG A, (H^g)^ = Vg, ■ (8) 

Card(A) ^-^ 

Then, 

penid(m) = - lln^ef - -((/-nm)/i*, e) - - ||ef • (9) 
n n n 

The term does not depend on m so it can be removed from the ideal penalty. The 

expectations of the two other terms are given by 

E[((/-nm)/x*, e)] =0 and E [lln^ef ] = 5^?;a where vx := ^\y. Yl (^0) 

Asm ^ ' iGA 

SO that E penjj(m) H — ||e||^ = — vx . (11) 

n n ^-^ 

Asm 
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Then, the key results we need for showing the penalty (6) satisfies Eq. (5) are concentration 
inequalities for ((I — nm)/i*, e) (Proposition 2) and for ||nme|P (Proposition 3). 

4.4.2 Two NEW CONCENTRATION INEQUALITIES 

First, for the linear term, we prove in Appendix B.2 the following result, mostly by applying 
Bernstein's inequality. 

Proposition 2 (Concentration of the linear term) Let m E A^„ and Ilm be defined 
by Eq. (8). // (Db) holds true, then for every x > 0, with probability at least 1 — 2e~^ , 

V0>o, |((/-n^)^*,.)|<0||^^-^^f + + . (12) 

Second, for the quadratic term, we prove in Appendix B.3 the following result, that relies 
on a combination of Bernstein and Pinelis-Sakhanenko inequalities. Note that directly using 
Talagrand's inequality (Bousquet, 2002) would lead to a less precise result in our setting, 
see Remark 7 in the appendix for details. 

Proposition 3 (Concentration of the quadratic term) Let m € Mn CLnd YLm be de- 
fined by Eq. (8). // (Db), (Vmin) and (Vmax) hold true, then, for every x > 0, with 
probability at least 1 — 2e"^, 



V0G(O,1], ||n„,ef-E lln^ef < 0^ l|n„^ef + '""L (13) 

L J L J tr 

4.4.3 Conclusion of the proof 

The first step towards Eq. (5) is to get a uniform concentration inequality for the ideal 
penalty from the combination of Eq. (9), Eq. (11), Proposition 2 and Proposition 3: for 
every x > 0, an event ^^(x) of probability at least 1 — 4e~^ exists on which 



ye G (0, 1] 



penid(m) + - ||ef - -Vwa 
n n ^-^ 

A 



40 9 

< — ||/i*-/I^r + r(x,0) , (14) 
n 



where r{x,9) := 213c^-^^v^i^x^ / {n6) . By definition (6) of m, for every m G A4n, 

- Il/i* - ^ffiW"^ + [pen(m) - pen^^{m)] < - ||^* - + [pen(m) - penid(m)] . (15) 

n n 

Therefore, uniform bounds on the deviations of pen(m) — penj^(m) — are sufficient 

to get an oracle inequality. Let {xm)meMn ^ (0; +00)-'^" to be chosen later, and define 
the event O := flmGAi,, ^m{xm)- By the union bound, P(Q) > 1 - 4:Y^rn.(^Mn 
combining Eq. (14) and (15), for every penalty such that pen(m) > 2n~^ SAem ''^x^'i^i^m, G) 
for every m E Ain, on (7, for every 9 G (0, 1], 



1 -40 * ^ „2 . , 1 + 40 „ , ^ „2 

Ia* -Mmll < mf < ll/i -/imil +pen(m) 



n meM n 



- yZ vx + r{xm,0) > 
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This proves a general oracle inequality (stated as Theorem 8 in the appendix) , which implies 
Theorem 1 by taking Xm = Dm{log{2) + 1 + log(;^)) + log4 + x. Indeed, taking 6 = 1/12, 
we get 2 Y^xem "^A + nr{xm, 0) < Vraa.xDmC[l + log(-^)] + Cs for some constants C, C3, and 
rh remains unchanged by removing C3 from the penalty. Finally, the probability of is 
upper bounded by 

Card{mGA^,/I)„ = I)}e-^(^°s(2)+i+iog(^))-.. <g-x^2-^^g-x _ 

l<D<n D>1 

5. Kernel multiple change-point algorithm 

This section summarizes the multiple change-point estimation algorithm suggested by The- 
orem 1, and gives some examples of kernels for vectorial and non-vectorial data. 

5.1 Algorithm 

Input: observations Xi, . . . , Xn £ X, a positive definite kernel k : X x X ^ M., some 
constants C > 1, Dmax < n and Umax such that (Vmax) holds true. 

1. Define ^>(x) = k{x, •) G Ti, for every xGX andY = ($(Xi))i<i<„ G n"". 

2. Define Jim £ such that VA G m , Vi G A , {'f2m)i = '^"^ XljeA > every 
m £ Ain, where 7W„ denotes the set of segmentations of { 1, . . . , n}. 

3. Compute mo G argmin^^j^^ jj^^^{n-~^\\Y - JlmW^}, for every D £ {1, . . .,Drai,x}. 

4. Compute D G argmin^gii ||y - /l^f + g^maxJ„ (log(-^) + 1)}. 
Output: segmentation in = fhf)- 

The above algorithm can be seen as a kernelized version of the one proposed by Lebarbier 
(2005). Our main contributions are the theoretical guarantees of Section 4 and the experi- 
ments of Section 6. 

Computational complexity: For each fixed D, step 3 is the dynamic programming 
algorithm proposed by Harchaoui and Cappe (2007); see also (Kay, 1993). Computing 
('7i_D)i<D<_Dniax requires at most ©(Dmax'T'^) times the cost of computing any k{Xi^Xj). 

Setting Umax^ If (Db) holds true, one can always take fmax = but this bound might 
be loose. In most real-world applications, it is realistic to assume that < t < t < 1 are 
known such that all the change-points belong to [t, t], that is, < t < and t\)*_i <t<\. 
Such "edge instants" can usually be inferred from real-world knowledge, as in Section 6.2. 
Then, the signal is stationary over [0,t] and over [t, 1, and we propose to estimate i^max by 

^^max = maxjtr (^So;t^ ,tr (^Sj.^^ I (16) 

where T^a-.b is the empirical covariance estimator of [^[Xi))a<ti<b- We shall use this estimate 
in the experiments of Section 6. 



9 



Arlot, Celisse and Harchaoui 



5.2 Examples of kernels 

The algorithm of Section 5.1 can be used with various sets X (not necessarily vector spaces), 
and with several different kernels k for a given X. In particular, our approach is flexible 
with respect to the nature of data. It can handle any type of data as long as positive-definite 
kernel similarity measure for such data is available. Instances of such data are simplicial 
data (histograms), texts, trees, among others (Shawe- Taylor and Cristianini, 2004). Some 
classical kernel choices are detailed below. 

• when = M, k{x, y) = xy and we recover the algorithm by Lebarbier (2005) since 
||$(x) -$(x')||^ = {x-x'f. 

• when X = W^, k{x,y) = {x, y)^d yields its natural extension since ||$(a;) — <I>(x')||^ = 
Yli=i i^i ~ ^i)^ the squared Euclidean norm in M^. 

• when X = M'^, other choices are the Gaussian kernel with bandwidth h > 0, k^{x, y) = 
exp(— ||x — y|p /(2/i^)) and the Laplace kernel with bandwidth /i > 0, kj^{x,y) = 
exp(— ||x — y\\ /(2/i^)); see Section 6 for experimental results with such kernels. 

• when X = {{pi, . . . ,pd) G [0, l]'^ such that pi + • • • + Pd = 1} the set of d-dimensional 
histograms, the intersection kernel is k{p,q) = XliLi ™fii(p«i 9i) (Hein and Bousquet, 
2004; Maji et al., 2008); see Section 6 for experimental results with such a kernel. 

6. Simulation experiments 

We now present experimental results on the performance of our approach, respectively on 
synthetic data and on real data. 

6.1 Synthetic data 

First, we study the statistical behaviour of our approach for estimating the change-point 
locations of synthetic time series with A' = M and n = 1000. The 9 change-point locations 
are fixed and chosen so the segments have various lengths, see the middle part of Figure 1. 
The intra-segment distributions are chosen randomly among the first ten probability distri- 
butions considered by Marron and Wand (1992) with common mean and variance. Since 
they only differ by their higher-order moments, standard approaches aiming at detecting 
changes in the mean or in variance would fail in such a situation. 

We take the Gaussian RBF kernel k{x,y) = exp(— (x — y)"^ /{2h?)) with 2h? among 0.1, 
1 and mediani<jj<„{||Xj — the latter being a classical heuristic in kernel-based 

methods. We use the strategy presented in Section 5 and estimate fmax with fmax • — 
max{tr(Ilo:j), tr(Sj.]^)} where t = 0.05, t = 0.95 and tj = i/n for all i. In preliminary 
experiments, we tested other strategies such as kernel-based counterparts of estimates of 

the eM^af^g^fe^a^ai^; ^cS^MMiB4Mcme-?«m^^fifec(^i^%e^5™etf^^ 

the bandwidth still lead to satisfactory results. A more detailed account of the performance 
of our algorithm is given in Figure 1, where the bandwidth is chosen with the classical 
heuristic. The left part of Figure 1 shows our criterion is minimal (in expectation) for 
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Figure 1: Synthetic data. Left: Expectations of the model selection criterion, empirical 
risk, and quadratic risk as a function of the number of candidate change-points. 
Middle: Pictorial representation of the frequency of detection of a change-point 
at each position; blue lines correspond to the true change-points. Right: Distri- 
bution of the estimated number of change-points D — 1. 





Synthetic 


Audio 


Video 


max^g{,-^ min,.g|,.^ J i - t* 




0.049±0.003 


0.061±0.005 


0.081±0.007 






0.053±0.006 


0.079±0.006 


0.093±0.007 



Table 1: Average Hausdorff distances between the estimated and true segmentation in the 
three experiments. 



the same number of change-points as the quadratic risk, which equals the true number of 
change-points. On Figures 1-2, one can notice the empirical risk increases for large D. This 
phenomenon is due to the fact that we use heuristic rules for making the computation oiffiD 
faster, that do not always give the exact minimizer of the empirical risk for large values of 
D. Nevertheless, our algorithm is still accurate enough around the true and selected number 
of change-points. The right part of Figure 1 confirms the estimated number of breakpoints 
D — 1 is distributed around their true number. The middle part of Figure 1 represents 
the frequency of detection of a change-point at each location; for representation purposes, 
we fitted a mixture of gaussians centered around the true change-points, so their standard- 
deviations represent the accuracy of estimation of each true change-point. In particular, 
we observe the change-points are rather accurately detected, and that shorter segments are 
harder to detect accurately. 

Table 1 provides results on the accuracy in estimating the change-point location in 
terms of Hausdorff distance between the set of estimated change-points {ti, . . . ,tD-i} and 
the set of true change-points { t^, . . . , }, a common distance measure in the literature 

(Boysen et al, 2009). 
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Figure 2: Real data experiment. Left and middle: our criterion and the empirical risk 
as a function of D, for one particular chunk (left: audio stream; middle: video 
stream). Right: Distribution of the estimated number of change-points D — 1 
(video). 



6.2 Real data: audio and video temporal segmentation 

We now tackle the problem of temporal segmentation of the audio (resp. video) stream 
of entertainment TV shows into semantically homogeneous segments: trailer, audience ap- 
plause, interview, music performance, and so on. We considered 50 chunks of audio (resp. 
video) streams delimited with two annotated changes at the border of this chunk. For each 
chunk, the true number of segments (given by manual annotation of semantically homoge- 
neous parts of the TV show) is 5, and our goal is to recover these segments automatically, 
without knowing their number. Each chunk's length is at most 30 minutes of the TV show 
and of 20 minutes on average. 

Audio part We extracted every 10 ms the first 12 Mel Frequency Cepstral Coefficients 
(MFCC) of the audio track (Rabiner and Schafer, 2007). MFCCs are commonly used fea- 
tures in speech recognition and audio processing. They provide a representation of the 
short-term power spectrum of a sound. We subsampled the signal when necessary to reduce 
the computing time of the dynamic programming part of our method. We used the Gaus- 
sian RBF kernel with a bandwidth automatically set using the classical heuristic rule as in 
Section 6.1. We present the performance of our approach on one particular audio chunk in 
Figure 2 (left). On this example, our approach selects the correct number of change-points 
in the time series on average (see Figure 3 in Appendix D) with a good accuracy (Table 1). 

Video part We extracted 1024-dimensional GIST descriptors for each frame of the video 
track (Oliva and Torralba, 2001). GIST descriptors aggregate perceptual dimensions (nat- 
uralness, openness, roughness, expansion, ruggedness) that represent the dominant spa- 
tial structure of a scene. Again, we subsampled the signal when necessary to reduce 
the computing time. We used the so-called intersection kernel (Hein and Bousquet, 2004; 
Maji et al., 2008), which is appropriate for data belonging to d-dimensional simplices such 
as histograms- like GIST descriptors. Note that an attractive feature of the intersection ker- 
nel is that there is no hyperparameter (bandwidth) to tune. We present the performance 
of our approach on a particular video chunk in Figure 2 (middle and right). Here, the 
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good performance of our approach is less clear, as the average of the selected dimensions 
by our approach is 8.85 instead of 5. There are two explanations: (i) our estimate of Umax 
is too rough, and over-segmentation is favored in the subsequent criterion, (ii) the GIST 
descriptors are too loose descriptors for this task. 

7. Conclusion 

We have proposed a penalty generalizing the one of Lebarbier (2005) to the kernel change- 
point problem, and showed it satisfies a non-asymptotic oracle inequality. Such an extension 
significantly broadens the possible applications of this penalization approach to the change- 
point problem. The theoretical tools developed for our method could also be used in other 
settings, such as clustering in general Hilbert spaces. As a future direction, we would like 
to investigate the kernel selection problem, which remains a major issue as in most machine 
learning problems (see the discussion of Section 4.3). 
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Appendix A. Oracle inequality with a small collection of segmentations 

Let us state a result which shghtly differs from the primary goal of the paper (Theorem 1) 
but is a byproduct and can be of independent interest. Assume a subset A^^ of the set Ain 
of all segmentations of { 1, . . . , n} is given such that 

3aM'„>0, Card {M'^) < n"^^'^ . (Pol) 

In particular, this setting corresponds to the situation where some prior knowledge restricts 
possible change-point locations to a subset of {ti, . . . with O(logn) elements. Let us 
now consider the model selection procedure defined by 

rh G argmin^^_y^^, | - Wfim -Y\\'^ + pen(m) I . (17) 
" n J 

Then, Mallows' heuristics (Mallows, 1973) states that pen(m) ~ E[penj^(m)] leads to an 
oracle inequality. Making this informal argument rigorous, we obtain the following theorem, 
where assumption (Vmin) is replaced by a weakest assumption: 

^ J^2 

30 < Cmin < +00 , Vm G M'n , VA G m , vx ■= „ s ""i > = Vmin ■ (Vmin') 

iSA 



Theorem 4 If (Db), (Vmin'), (Vmax) hold true and if fh satisfies Eq. (17) with 

Vm G M'n , pen(m) > - ^ , (18) 

Agm 

then, for every x > 0, an event of probability at least 1 — e~^ exists on which, for every 

ee (0,1/8), 

nififf,) < — ^ inf \ (l+4^)7^(/I„)+pen(m)-- Vt^A \ 

\ AGm ) 

+ [x + log(4Card(A^;))]^^^%^ • 

Theorem 4 is proved in Section B.6. Note that Theorem 4 holds for every M!^ (even 
= A4„), but the remainder term is only reasonably small if assumption (Pol) holds 

true. 

Since (Vmax) implies 2 'Y/jXi^m < '^DmVraa.x, we get a formula for the penalty if an 
upper bound on I'max is known or can be estimated. The corresponding procedure satisfies 
the following oracle inequality. 

Corollary 5 In the framework of Theorem 4, let us assume some constant A > exists 
such that 

pen(m) = > . (19) 

n n 
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Then, for every x > 0, an event of probability at least 1 — e ^ exists on which, for every 
(0,1/8), 

n{jiff,)<^(i + -^) inf {7^(/^„)} 

Vmin V log(n) J m£M'„ 

+ 426AcLJ°"^"^^" + ^°"^''^^^'^-^"^^^ . (20) 

n 

Corollary 5 is proved in Section B.7. If (Db) holds true for some known constant M (for 
instance, M = 1 with the Gaussian and Laplace kernels), one can take A = > fmax in 
the penalty (19). 

If ^ = fmax) one recovers the leading constant Wmax/'J^min in front of the oracle inequality, 
which is the price for ignoring the variations of noise along the signal. In particular, when 

VI < i < n, Vi = Umax > , (Vc) 

(Vmin') holds true with Vmin = ^max and the leading constant in the oracle inequality (20) 
is one at first order. If assumption (Pol) holds true, the remainder term is of order at most 
(log(n))Vn so that (20) is an "optimal" oracle inequality similar to the one proved when 
Ti = Mhy Birge and Massart (2007) in the Gaussian regression setting. 

The reason why penalties in Eq. (6) and (19) are different is that Eq. (19) only yields 
a good penalty when (Pol) holds true, so not for change-point detection as in Theorem 1. 
Indeed, Eq. (5) holds for pen(m) ~ E[penjj(m)] when the collection of models is "small" 
(that is, if (Pol) holds true), but not with a collection as large as Mn- Eq- (6) shows which 
additional terms are necessary to get Eq. (5) with a "large" collection of models like jVl^^. 

Appendix B. Proofs 

This section gathers the proofs of all results stated previously in the paper. 
B.l Proof of the statements of Section 4.4.1 

Proof of Eq. (8) Let / € 5m, . For every A G m, let us define f\ as the common value of 
{fi)ie\, and ^ 

Card(A)^^* • 

Then, 

Wf-af = ^^Wfx- 9i\\H 

Agm igA 

[ 11-^^ ~ ^^llw + 11^* ~ ^^llw + 2 (/a - 9x, 9\ - 9i)', 



Agm igA 



^ [Card(A) ||/a - 5aII?^] + X] X] - 9\\\n + '^Y Y ~ 

Agm Agm igA Agm \ igA 

Y [Card(A) ||/a -5aII?^] + X] 5Z \\9i-9\\\n ■ 



Agm Agm igA 
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Then, ||/ — g\\^ is minimal if and only if f\ = g^^ for every A G m. 

For proving Eq. (9), we compute the empirical cind. quadratic risks of ^rn- 



1 



n 



n 



The term n~^\\^* — 
Proof of Eq. (21) 

11^ - V"m\ 



ill.- 

n 
n 



IV-lkll' 



n 

-k \\2 liTT 11^ 

n 



-||n„e||^ + -((/-n„)^*, e) 
n n 



(21) 
(22) 



u* IP 



is called approximation error, or bias. 



n„/i*f + lie - n^ef + 2 (/i* - n^/x*, e - n^e) 

Mmf + Ikf - lin^ef + 2 ((/ - n„)^*, e) 



= \\Y-Ii^Y 

= 11/^" 

since is an orthogonal projection. ■ 
Proof of Eq. (22) 

ll^'' - /imf = ll/i* - ^J*mf + 2 (Ai* - ^xl,, Time) + ||n„ef 

= - + lin^ef 

since 11^ is an orthogonal projection. ■ 

Proof of Eq. (9) Eq. (9) follows from Eq. (21)-(22) and from the definition (5) of the 
ideal penalty. ■ 
For proving Eq. (10), we will use that 



Vz,j G {l,...,n} , E[(ei, 



E[A:(^.,X,)]- 11^411^ 



(23) 



Proof of Eq. (23) For every i,j G { 1, . . . , n}. 



E[(«i>(x,), Hx^))^] -E[(A.^ HXj))^'_ 
E[($(x,), <I>(x,■))^]-(A^^^}>^ 



E 



'H 



Proof of Eq. (10) The first equality comes from the fact that E[(/, e)] = for every 
(deterministic) / G by definition of e = y — /i*. For the second equality, Eq. (8) implies 



in™ el 



E 

Asm 



jSA 



E 

Asm 

E 

Asm 



1 

nx 



E' 

iSA 



H 



-E( 



En . £ 



(24) 



iJSA 
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where VA G m, nx := Card (A). Now, using Eq. (23), we get 



E 



|n„,e|| 



AGm 



Eq. (11) follows from Eq. (9)-(10). 

B.2 Proof of Proposition 2 

Let us note 



= (/X* - e) = ^ with Zi = {{^m" - Ei)^ 



i=l 



The Zi are independent and centered, so Eq. (26)-(27) in Lemma 6 below (which requires as- 
sumption (Db)) show the conditions of Bernstein's inequality are satisfied (see Theorem 9). 
Therefore, for every x > 0, with probability at least 1 — 2e~^, 



< V2Wmax llAt* - ^4n\\ ^ 



26 

for every 9 > using 2ab < Oa? + 6~^b'^. ■ 
A key argument in the proof is the following lemma. 

Lemma 6 For every m G Mn, if (Db) holds true (hence also (Vmax)), the following 
holds with probability one: 



V«E{l,...,n} 11^*11^ <M, \\ei\\^<2M 
and Wifi^ - < 2M so that \Zi\ < AM"^ 

n 

In addition, Var [Zi) < fmax II/"* — lAn\\^ 

1=1 



(25) 
(26) 

(27) 



Proof [of Lemma 6] First, remark that for every i, 



Vi = E 



so that with (Db) 



E[k{Xi,Xi)]-\\fi*\\l>0 



/i^ll^ <E[A;(X„X,)] , 



which proves the first bound in Eq. (25). As a consequence, by the triangular inequality, 

||e*||« < ||y»||^ + ||/i*||^ <2M , 
that is, the second inequality in Eq. (25) holds true. 
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Let us now define, for every i € { 1, . . . , n}, A(i) as tlie unique element of m such that 
i G X{i). Then, 



Card(A(f)) 



SO that the triangular inequality and Eq. (25) imply 

< sup \\^lt-^^*j\\^< sup ||^^-^*||^<2 sup ||/i*||^<2M , 

that is, the first part of Eq. (26) holds true. The second part of Eq. (26) directly follows 
from Cauchy-Schwarz inequality. For proving Eq. (27), we remark that 



E 



n 



by Cauchy-Schwarz inequality 



by (Vmax) , 



so that ^ Var ( ) < 



II * _ * l|2 



i=l 



B.3 Proof of Proposition 3 

This proof is inspired from Sauve (2009), where a similar concentration inequality was 
needed for real-valued data, in the context of regression with piecewise polynomial esti- 
mators. As in our setting, Talagrand's inequality was not precise enough in the setting of 
Sauve (2009). 
Let us define 



in^ef = V Ta with Tx := — 
77^ ^\ 



according to Eq. (24). Now, remark that iT\)\^m is a sequence of independent real-valued 
random variables, so we can get a concentration inequality for via Bernstein's inequality, 
as long as T\ satisfies some moment conditions (see Theorem 9). The rest of the proof will 
consist in showing such moment bounds, by using Pinelis-Sakhanenko deviation inequality 
(Proposition 10). 

First, we showed in the proof of Eq. (10) that for every A € m, E[Tx] = v\. Second, for 
every q > 2, 



1 



E 



fcGA 



2q 



n 



1 



n 



2nxM 



2^x29- 



A ^0 







> X 






H 



dx 
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since for every k, ||efc||-^ < 2M almost surely by Lemma 6, using (Db). Using again that 
Ikfcll-^ < 2M a.s., we get that for every p >2 and A G m, 



fcGA fcGA \A,-gA / ^ ^ 



p-2 



that is, the assumption of Pinelis-Sakhanenko deviation inequality (see Proposition 10) 
holds true with c = 2M/3 and = Ylkex'^k- Therefore, using (Vmin), we get 



1 r2nxM 

^\ Jo 



< 


- f 




™A Jo 


< 


4q r 




^iJo 



2nxM 



x^'^ ^ exp 



< 2 X {q\) 



2^;a 1 + 



2(nA^A + ^) 



dx 



dx 



since for every > 1, 



-u^^-i exp(-'uV2)d'u = 29-^ (g - 1) 



Finally summing over A G m, it comes (using in particular that Cmin ^1) 



EiEK]<|x4j: 

A£m Xdm 



2vx[l + 



<^x4V 
-2 ^ 

Agm 



14Cmin?^A 



^ r) ^ ^ ( 87.5 Ci^jjjVuiax^A ) [SCmin^^max]'^ 



that is, condition (35) of Bernstein's inequality holds with 

V = 87.5 f maxCmin X] '"^ ^'^'^ ^ ^"^^ 
Asm 



mm '^max 



Therefore, Bernstein inequality (see Theorem 9) shows that for every x > 0, with probability 
at least 1 — 2e~^, 



\Tm - E [Tm]\ < /l75VmaxC^in ^ + 5z;maxCmin2; 

V Asm. 



< 



< 



A 



44ci;, 



+ SCjiiin 'I'max^J 



49ct 



for every 9 G (0, 1], using also that Cmin > 1- 
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Remark 7 Let us emphasize that the classical approach for proving concentration results 
on Unwell when e is bounded would not yield a result as precise as Proposition 3. Using for 
instance Talagrand's inequality (see Bousquet, 2002), we get 



\I\.me\\= sup \{f,Iim£)\= sup 
/GW", 11/11 = 1 /GW",II/I 



H 



i=l 



and remark that for every f G T-L^ , the variables {fi, (n^m£)i)'H <ife real-valued, independent, 
centered and bounded. Then, instead of a Tfidifi dGvicitioTi tGTTfi Ov with v — IE[||n.^ e|| ] as 
in Eq. (13), we would have v of order Yl^=i^^'9 f&H^ ,\\f\\=i'^[{fii (nm£)j)^] which is much 
larger than E[||nme|P] = sup^g^n ^"^^^ E[(/j, In the remainder of the 

proof, we do need a main deviation term \j2vx with v proportional to E[||nme|P], which is 
why we have to prove a result like Proposition 3. 



B.4 Proof of a general model selection theorem 

As sketched in Section 4.4, we first prove a general oracle inequality from which Theorems 1 
and 4 are corollaries. 

Theorem 8 Let C Mn o,nd fh be some model selection procedure satisfying 

' 1 



m e argmin^g_y^4, <^ - - + pen(m) 



n 



(28) 



Assume that (Db), (Vmin), and (Vmax) hold true. Let {xm)meMn ^''^V collection of 
nonnegative numbers and assume that 



\/m£Mn, pen{m) > -'S^ vx + r{xm,0) , 



(29) 



AGm 



with r{xm,0) ■= 213c^jj^Umax2;(6'n) ^. Then, an event ^(xm) existe such that F^il^^^m)) — 
1 - 4Em,eA4' 6""=™ and, on for every 9 G (0, 1/8), 



1 I - 2 1 1 - 2 

-fimW < 777 inf < (1 + 46^) - ll/i* - /fmll + pen(m 



n 



I - AO m£M 



n 



) - - vx + r{xTn,0) \ 
n ^ 

AGm ) 



Proof [of Theorem 8] The first step is to combine Eq. (9), Eq. (11), Proposition 2 and 
Proposition 3. We get that for every x > 0, an event ^^^(a^) of probability at least 1 — 4e~^ 
exists on which, for every > 0, 



penid(m) + - ||ef - 3 



n 



n 



29 
< — E 
n 

<^E 
n 



l/i - fir. 



+ 



+ 



98Cjj^jj^ ^maxX 

9 ~ 



+ 2 



+ 



4M2 \ X 



98<in + l , 8Cn 



29 ' 3 / n 

^maxX 

n 



where we used (Vmin) 
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Using again Proposition 3 in combination with Eq. (22), we get that on Qmi^) 



ye G (0, 1) , E 



< (l-0)-U||^*-/2™f + 0-^49c^inWrEl . (30) 



Therefore, on i7m(x), for every 6 G (0, 1/8), 



penid(m) + i ||ef - ^ ^ 



< 



29 



(1 - 9)n 



IfJ- - fJ-m.\\ + 



2(l + (l_^)-i)49c^.^ + l 8c 



+ 



n 



40 n * ^ 1,2 
< — ll/i - /Umll + 

n 



4c 



210C„ + ^ + 1 



en 



40 



n 



(31) 



(32) 



where 



ro{x, 



210c2 , + i^i^i + 1 



VmaxX , 213Cj^jj^1'ixiax3J 



0n 



< 



en 



r{x,e) . 



Then, let := (l^gA^^ ^™,(2;,n). By the union bound, P(!^) > 1 - 4^^g^^e-^'". 

Now, by definition (28) of m, for every m G M'n, 

- WfJ'* - Mmll^ + [pen(m) - penid(?n)] < - ||^* - + [pen(m) - penid(m)] . (33) 

n n 



Therefore, on ^(^xm)^ combining Eq. (32), (33) and the condition satisfied by pen(m), we 
get the result for aU 9 G (0, 1/8). ■ 



B.5 Proof of Theorem 1 

We apply Theorem 8 with M'^ = Mn and Xm = Dm(log(2) + 1 + log(;^)) + log4 + x. 
Indeed, the probability of ^ then is upper bounded by 

4 ^ 6"^""= ^ CaT:d{m e Mn / Dm = D}ex.p 

m£M,i l<D<n 

-D(log(2) + l + log(^)) -X 
<e-^' exp(-Z)log(2)) < ^ 2"-° = e"^ . 

l<D<n D>1 



-L>(log(2) + l + log(^)) -X 



n — I 
D - 1 



exp 
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Furthermore, we get for every G (0,1 /8) that 



- > , ■VA + r{x„i,0) < h 



n 



Asm 



hi 



< 



2 + 0-i213cL„( log(2) + l+log 



n 

Drn 



DrnV 



m ''max 



n 



< 



+ 213c^i^(log4 + a; 



n 



Ci + C2 log 



6n 
n 



+ 



C3 



n 



with 



1„2 



Ci = Ci{9) = mie-'c 



1„2 



C2 = C2{9) = 2130-^c: 

C3 = C3(x,e)=C2(^)^;max(log4 + x) . 

Note that C3{x,9) is an additive term independent from m, so it can be safely removed 
from the penalty. 

Finally, taking 9 = 1/12 yields the result as long as Ci/c^in and 02/0^^^^ are larger than 
some numerical constant Li = 4332. ■ 

B.6 Proof of Theorem 4 

First note that Theorem 8 does not rely on (Vmin) but only uses that (Vmin') holds true. 
Then, let us take Xm = x + log(4Card(A^n)) for every m G Ai'^ with x > in Theorem 8. 
First, we get 



> 1 



E 



g-log{4Card{.M;j) ^ 



Second, the condition (29) can be reduced to Eq. (18) since the term r(x,9) no longer 
depends on m. Therefore, it can be removed without changing the penalization procedure. 



B.7 Proof of Corollary 5 

We start from Theorem 4, denoting by $7 the event on which the oracle inequality holds 
true. 

First, assumption (Vmax) guarantees the penalty defined by Eq. (19) satisfies Eq. (18). 
Then, using assumption (Vmin'), we get 



min j 

AGm. 

A 



A 



1 2Drr..V 



m ''mm 



< 



1 ) 2 E 

AGm 
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Therefore, on 17, since Eq. (30) holds true with x replaced by x + log(4Card(7W^))), we get 



2DmA - 2 ^ ^;A < (1 



Asm 

1^* - /Imir + ^~^49c^inWmax {x + log(4Card(A^;,)) ) 



A 



Vrr] i T 



(34) 



So, on il, for every 9 £ (0, 1/8), 



n 



I -AO 



+ „ [x 

no 



meM' I n 



+ log(4Card(7W;j)) 



426 + 



49 



(l-0)(l-40) \ v, 



A 



1 



We get the result by taking 9 = On = (log(?i)) ^ since for n larger than some numerical 
constant, 



426 + 



l-40„ 
49 



(1 - 0„)(1 - 4^™) V^: 



^min 

< max <! 426 



log(n) 



49 



(l-0„)(l-40„) j 



A 426A 
< 



and — —2 S ^c^ir 

where we used (Vmin), A > fmax > "f^mm) and tJmax < M'^. 



Appendix C. Some useful results 

This section collects a few results that are used throughout the paper. 

Theorem 9 (Bernstein's inequality, see Proposition 2.9 in (Massart, 2007)) 

Let Xi, . . . ,Xn be independent real valued random variables. Assume there exist positive 
constants v and c satisfying for every k >2 



n 



< —vC 



k-2 



(35) 



i=l 



Then for every x > 0, 



n 

Y,{X, -W.[X,\) > + 

i=l 



CX 



< e" 



In particular, if for every i, \Xi\ < 3c almost surely, Eq. (35) holds true with v = Var (Xi 

Proposition 10 (Pinelis and Sakhanenko (1986), Corollary 1) Let Xi, . . . , Xn hen 

independent and identically distributed random variables with values in some Hilbert space 
T-L. Assume the Xi are centered and that constants a^,c > exist such that for every p > 2, 

n 

2=1 
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Then, for every x > 0, 



i=l 



> X 



n 



< 2exp 



2(o-2 + cx; 



Appendix D. Additional simulation results 

This section gathers a some additional results concerning the experiments of Section 6.1. 



Kernel bandwdith 


Risk ratio 


h = 0.1 


3.56±0.17 


h = 1.0 


3.06±0.15 


adaptive h 


1.61± 0.15 



Table 2: Synthetic data. Risk ratio E[Tl{fiff^) / inimeMni'^ iJ^m.)}] three bandwidth 
choices. 




5 10 15 20 25 30 35 40 45 



Figure 3: Real data experiment, audio stream: Distribution of the estimated number of 
change-points D — 1. 
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